#########
## Nature Human Behaviour 
## Leading Countries in Global Science Increasingly Receive More Citations than Other Countries Despite Doing Similar Research.
## https://doi.org/10.1038/s41562-022-01351-5
## Harvard Dataverse (Code and Metadata): https://doi.org/10.7910/DVN/WCOINR 
## Read Me File
#########

We provide the scripts, metadata, input files, etc. to reproduce the figures in our paper and in its supplemental information (SI) section. 

The data were pulled from a MAG database set up and housed in AWS's Athena. MAG is now called the OpenAlex project and is no longer curated by Microsoft. The code was mostly written in Python 3.6 and R. (The first script was written for and runs in Python 2.7.) The code is meant to be executed in order, moving from Step_X0A_*, Step_X0B_*, and so forth through Step_X7_*. We have provided the files needed to produce the figures in the manuscript and SI section. We have also provided the metadata and corpora files needed to build the NLLDA models, which we don't provide but can be produced with the scripts and data provided.  

Python scripts generally take in two parameters at the terminal line: the discipline ID provided by MAG and the language restriction, whether 'english_only' or 'all'. The main results only consider papers whose abstracts are in English (e.g., 'english_only'). The other condition, 'all', includes papers that were translated using Google Translate's API feature and is used to build the figures in the SI. Thus, there are two sets of files that refer to each condition, respectively.


INPUT_MAG_FieldIDs_Sherlock_Constraints.csv contains the discipline IDs and their names, as well as (over) estimates of their memory and time constraints for the most time intensive scripts. 

Step_X2_Python3_RR_MAG_KLD_MRQAP_and_Distortion_Network.py is the "core" script for the paper that runs the modeling and analysis. 
 
=== Part I: Metadata Preparation ===
- Step_X0A_Python2_Athena_MAG_Field_Metadata.py
This creates the metadata that will be used in the subsequent scripts. 
Input Parameter: Discipline
Input Parameter: Language
Output File: "OUTPUT_Python_MAG_Field_MetaData_Dict_"+str(discipline)+".pbz2"

- Step_X0B_Python3_Athena_MAG_Field_RAKE_and_GoogleAPI_Corpora.py
Extracts and creates the abstract corpora for the NLLDA models (and translates for 'all' condition). 
Input Parameter: Discipline
Input Parameter: Language
Output File: "OUTPUT_Python_MAG_Field_Corpus_RAKE_and_GoogleAPI_"+str(discipline)+".pbz2"
Output File: "TEMP_English_OUTPUT_Python_MAG_Field_Corpus_RAKE_and_GoogleAPI_"+str(discipline)+".pbz2"
Output File: "TEMP_NonEnglish_OUTPUT_Python_MAG_Field_Corpus_RAKE_and_GoogleAPI_"+str(discipline)+".pbz2"
Note: Some of the TEMP_* files are missing. This only affects Supplemental Information section. 

- Step_X0C_Python3_MAG_Yearly_RAKE_and_GoogleAPI_NLLDA.py
Creates the NLLDA models based on the inputed corpora and metadata for each field. 
Input Parameter: Discipline
Input Parameter: Language
Input File: "OUTPUT_Python_MAG_Field_MetaData_Dict_"+str(discipline)+".pbz2"
Input File: "OUTPUT_Python_MAG_Field_Corpus_RAKE_and_GoogleAPI_"+str(discipline)+".pbz2"
Input File: "TEMP_English_OUTPUT_Python_MAG_Field_Corpus_RAKE_and_GoogleAPI_"+str(discipline)+".pbz2"
Note: The output files are not in the repository, but creating these files can be done with the provided code and metadata files. 
Output File: "OUTPUT_Python_MAG_Yearly_NLLDA_Dict_Corpus_RAKE_and_GoogleAPI_EnglishOnly_"+str(discipline)+".pbz2"
Output File: "OUTPUT_Python_MAG_Yearly_NLLDA_Dict_Corpus_RAKE_and_GoogleAPI_All_"+str(discipline)+".pbz2"

- Step_X0D_Python3_MAG_NLLDA_Topic_Coherence_Scores.py
Creates the topic coherence scores for each NLLDA model. 
Input Parameter: Discipline
Input Parameter: Language
Input File: "OUTPUT_Python_MAG_Yearly_NLLDA_Topic_Coherence_Scores_EnglishOnly_"+str(discipline)+".pbz2"
Input File: "OUTPUT_Python_MAG_Yearly_NLLDA_Dict_Corpus_RAKE_and_GoogleAPI_EnglishOnly_"+str(discipline)+".pbz2"
Input File: "OUTPUT_Python_MAG_Yearly_NLLDA_Topic_Coherence_Scores_All_"+str(discipline)+".pbz2"
Input File: "OUTPUT_Python_MAG_Yearly_NLLDA_Dict_Corpus_RAKE_and_GoogleAPI_All_"+str(discipline)+".pbz2"
Output File: "OUTPUT_Python_MAG_Yearly_NLLDA_Topic_Coherence_Scores_EnglishOnly_"+str(discipline)+".pbz2"
Output File: "OUTPUT_Python_MAG_Yearly_NLLDA_Topic_Coherence_Scores_All_"+str(discipline)+".pbz2"

- Step_X0E_Python3_RR_MAG_Journal_Coverage.py
Creates a census of journal coverage in each field in MAG to censor for journals that have existed since 1980. 
Output File: "OUTPUT_Python_MAG_Journal_Coverage_From_1980_to_1990.csv.gz"
Output File: "OUTPUT_Python_MAG_Journal_Coverage_From_1990_to_2000.csv.gz"
Output File: "OUTPUT_Python_MAG_Journal_Coverage_From_2000_to_2010.csv.gz"
Output File: "OUTPUT_Python_MAG_Journal_Coverage_From_2010_to_2017.csv.gz"

- Step_X0F_Python3_RR_MAG_Journal_Censored_Yearly_RAKE_and_GoogleAPI_NLLDA.py
Creates a journal censored NLLDA model based on the journal census. 
Input Parameter: Discipline
Input Parameter: Language
Input File: "OUTPUT_Python_MAG_Journal_Coverage_From_1980_to_1990.csv.gz"
Input File: "OUTPUT_Python_MAG_Journal_Coverage_From_1990_to_2000.csv.gz"
Input File: "OUTPUT_Python_MAG_Journal_Coverage_From_2000_to_2010.csv.gz"
Input File: "OUTPUT_Python_MAG_Journal_Coverage_From_2010_to_2017.csv.gz"
Input File: "OUTPUT_Python_MAG_Field_Corpus_RAKE_and_GoogleAPI_"+str(discipline)+".pbz2"
Input File: "TEMP_English_OUTPUT_Python_MAG_Field_Corpus_RAKE_and_GoogleAPI_"+str(discipline)+".pbz2"
Input File: "addresses.csv"
Output File: "OUTPUT_Python_MAG_Journal_Censored_"+str(journal_censor_)+"_Yearly_NLLDA_Dict_Corpus_RAKE_and_GoogleAPI_EnglishOnly_"+str(discipline)+".pbz2"
Output File: "OUTPUT_Python_MAG_Journal_Censored_"+str(journal_censor_)+"_Yearly_NLLDA_Dict_Corpus_RAKE_and_GoogleAPI_All_"+str(discipline)+".pbz2"

- Step_X0G_Python3_RR_MAG_Journal_Censored_NLLDA_Topic_Coherence_Scores.py
Creates the topic cohesion scores for the journal censored NLLDA model. 
Input Parameter: Discipline
Input Parameter: Language
Input File: "OUTPUT_Python_MAG_Journal_Censored_"+str(journal_censoring)+"_Yearly_NLLDA_Dict_Corpus_RAKE_and_GoogleAPI_EnglishOnly_"+str(discipline)+".pbz2"
Input File: "OUTPUT_Python_MAG_Journal_Censored_"+str(journal_censoring)+"_Yearly_NLLDA_Dict_Corpus_RAKE_and_GoogleAPI_All_"+str(discipline)+".pbz2"
Output File: "OUTPUT_Python_MAG_Journal_Censored_"+str(journal_censoring)+"_Yearly_NLLDA_Topic_Coherence_Scores_EnglishOnly_"+str(discipline)+".pbz2"
Output File: "OUTPUT_Python_MAG_Journal_Censored_"+str(journal_censoring)+"_Yearly_NLLDA_Topic_Coherence_Scores_All_"+str(discipline)+".pbz2"

=== Part II: Modeling and Analysis ===
- Step_X1A_Python3_RR_MAG_Create_Nation_Label_to_Published_Year_Citation.py
Create the citation network for each discipline. 
Input Parameter: Discipline
Input File: "addresses.csv"
Output File: "OUTPUT_Python_MAG_RR_Citation_Country_by_Year_"+str(discipline)+".csv.gz"

- Step_X1B_Python3_RR_MAG_Create_Journal_Censored_Nation_Label_to_Published_Year_Citation.py
Create the citation network for each discipline using journal censoring. 
Input Parameter: Discipline
Input File: "addresses.csv"
Input File: "OUTPUT_Python_MAG_Journal_Coverage_From_1980_to_1990.csv.gz"
Input File: "OUTPUT_Python_MAG_Journal_Coverage_From_1990_to_2000.csv.gz"
Input File: "OUTPUT_Python_MAG_Journal_Coverage_From_2000_to_2010.csv.gz"
Input File: "OUTPUT_Python_MAG_Journal_Coverage_From_2010_to_2017.csv.gz"
Output File: "OUTPUT_Python_MAG_RR_Journal_Censored_Citation_Country_by_Year_"+str(discipline)+".csv.gz"

- Step_X2_Python3_RR_MAG_KLD_MRQAP_and_Distortion_Network.py
This script is the "core" of our modeling. We create the "citational well" and run the QAP models. 
Input Parameter: Discipline
Input Parameter: Language
Input File: "addresses.csv"
Input File: "OUTPUT_Python_MAG_Field_MetaData_Dict_"+str(discipline)+".pbz2"
Input File: "OUTPUT_Python_MAG_RR_Citation_Country_by_Year_"+str(discipline)+".csv.gz"
Input File: "OUTPUT_Python_MAG_RR_Journal_Censored_Citation_Country_by_Year_"+str(discipline)+".csv.gz"
Input File: "OUTPUT_Python_MAG"+str(journal_censored_filename)+"Yearly_NLLDA_Dict_Corpus_RAKE_and_GoogleAPI_EnglishOnly_"+str(discipline)+".pbz2"
Input File: "OUTPUT_Python_MAG"+str(journal_censored_filename)+"Yearly_NLLDA_Topic_Coherence_Scores_EnglishOnly_"+str(discipline)+".pbz2"
Input File: "OUTPUT_Python_MAG"+str(journal_censored_filename)+"Yearly_NLLDA_Dict_Corpus_RAKE_and_GoogleAPI_All_"+str(discipline)+".pbz2"
Input File: "OUTPUT_Python_MAG"+str(journal_censored_filename)+"Yearly_NLLDA_Topic_Coherence_Scores_All_"+str(discipline)+".pbz2"
Output File: "OUTPUT_Python_Field_MRQAP_Beta_KLD_Citation_Corpus_RAKE_and_GoogleAPI_EnglishOnly_"+str(discipline)+".csv.gz"
Output File: "OUTPUT_Python_Field_KLD_Citation_Edgelist_Corpus_RAKE_and_GoogleAPI_EnglishOnly_"+str(discipline)+".csv.gz"
Output File: "OUTPUT_Python_Field_Delta_Degree_Centrality_KLD_Citation_Corpus_RAKE_and_GoogleAPI_EnglishOnly_"+str(discipline)+".csv.gz"
Output File: "OUTPUT_Python_Field_Degree_Centrality_KLD_Citation_Corpus_RAKE_and_GoogleAPI_EnglishOnly_"+str(discipline)+".csv.gz"
Output File: "OUTPUT_Python_MAG"+str(journal_censored_filename)+"International_KLD_Edgelist_Corpus_RAKE_and_GoogleAPI_EnglishOnly_"+str(discipline)+".csv.gz"
Output File: "OUTPUT_Python_Field_MRQAP_Beta_KLD_Citation_Corpus_RAKE_and_GoogleAPI_All_"+str(discipline)+".csv.gz"
Output File: "OUTPUT_Python_Field_KLD_Citation_Edgelist_Corpus_RAKE_and_GoogleAPI_All_"+str(discipline)+".csv.gz"
Output File: "OUTPUT_Python_Field_Delta_Degree_Centrality_KLD_Citation_Corpus_RAKE_and_GoogleAPI_All_"+str(discipline)+".csv.gz"
Output File: "OUTPUT_Python_Field_Degree_Centrality_KLD_Citation_Corpus_RAKE_and_GoogleAPI_All_"+str(discipline)+".csv.gz"
Output File: "OUTPUT_Python_MAG"+str(journal_censored_filename)+"International_KLD_Edgelist_Corpus_RAKE_and_GoogleAPI_All_"+str(discipline)+".csv.gz"

- Step_X3_Python3_RR_MAG_Country_Vignette.py
The code combines the citational well flat files together. 
Input File: "INPUT_R_RR_Nation_to_Regional_Classification.csv"
Input File(s): "OUTPUT_Python_Field_KLD_Citation_Edgelist_Corpus_RAKE_and_GoogleAPI_EnglishOnly_*"
Output File: "OUTPUT_Python_MAG_RR_Delta_Regional_Vignette.csv.gz"

=== Part III: Figures and Plots ===
- Step_X4_R_RR_Plot_Figure_2.R
Input File: "INPUT_R_Nation_to_Region_Core.csv"
Input File(s): "OUTPUT_Python_Field_Degree_Centrality_KLD_Citation_Corpus_RAKE_and_GoogleAPI_EnglishOnly_*"
Input File(s): "OUTPUT_Python_Field_MRQAP_Beta_KLD_Citation_Corpus_RAKE_and_GoogleAPI_EnglishOnly_*"
Output File: "NEW_FIGURE_R_Plot2.pdf"

- Step_X5_R_RR_Plot_Figure_2_Supplemental_Information.R
Input File: "INPUT_MAG_FieldIDs_Domain.csv"
Input File: "INPUT_R_Nation_to_Region_Core.csv"
Input File(s): "OUTPUT_Python_Field_MRQAP_Beta_KLD_Citation_Corpus_RAKE_and_GoogleAPI_EnglishOnly_*"
Input File(s): "OUTPUT_Python_Field_MRQAP_Beta_KLD_Citation_Corpus_RAKE_and_GoogleAPI_All_*"
Nation-Labeled Cohesion Score
Output File: "SM_Figure_Coherence_Cutoffs_Num_Countries.pdf"
Output File: "SM_Figure_Coherence_Cutoffs_Beta_Coefficients.pdf"
Output File: "SM_Figure_Coherence_Cutoffs_Beta_Coefficients_DOMAIN.pdf"
Citation Deflation
Output File: "SM_Figure_Deflated_Beta_Coefficients.pdf"
Output File: "SM_Figure_Deflated_Beta_Coefficients_DOMAIN.pdf"
Core, Periphery, and Core+Periphery Country Inclusion in the QAP Models
Output File: "SM_Figure_CorePeriphery_Beta_Coefficients.pdf"
Output File: "SM_Figure_CorePeriphery_Beta_Coefficients_DOMAIN.pdf"
Language Censoring
Output File: "SM_Figure_Language_Beta_Coefficients.pdf"
Output File: "SM_Figure_Language_Beta_Coefficients_DOMAIN.pdf"
Journal Censoring
Output File: "SM_Figure_JournalCensored_Beta_Coefficients.pdf"
Output File: "SM_Figure_JournalCensored_Beta_Coefficients_DOMAIN.pdf"
Pooling Statistically Significant and Not Significant Beta Coefficients in the QAP Models
Output File: "SM_Figure_Significant_Beta_Coefficients.pdf"
Output File: "SM_Figure_Significant_Beta_Coefficients_DOMAIN.pdf"

- Step_X6_R_RR_Plot_Figures_3_thru_8.R
Input File: "INPUT_MAG_FieldIDs_Domain.csv"
Input File: "INPUT_R_Nation_to_Region_Core.csv"
Input File(s): "OUTPUT_Python_Field_Delta_Degree_Centrality_KLD_Citation_Corpus_RAKE_and_GoogleAPI_EnglishOnly_*"
Input File(s): "OUTPUT_Python_Field_Delta_Degree_Centrality_KLD_Citation_Corpus_RAKE_and_GoogleAPI_All_*"
Output File: "NEW_FIGURE_R_Plot3.pdf"
Output File: "NEW_FIGURE_R_Plot4.pdf"
Output File: "NEW_FIGURE_R_Plot5.pdf"
Output File: "NEW_FIGURE_R_Plot6.pdf"
Output File: "NEW_FIGURE_R_Plot7.pdf"
Output File: "NEW_FIGURE_R_Plot8.pdf"
Nation-Labeled Cohesion Score
Output File: "SM_Figure_Coherence_Cutoffs_Delta_2000_2012.pdf"
Output File: "SM_Figure_Coherence_Cutoffs_Delta_over_CountrySpecific.pdf"
Output File: "SM_Figure_Coherence_Cutoffs_Delta_over_CountrySpecific_DOMAIN.pdf"
Output File: "SM_Figure_Coherence_Cutoffs_Delta_over_CorePeriphery.pdf"
Output File: "SM_Figure_Coherence_Cutoffs_Delta_over_CorePeriphery_DOMAIN.pdf"
Citation Deflation
Output File: "SM_Figure_Deflated_Delta_over_CountrySpecific.pdf"
Output File: "SM_Figure_Deflated_Delta_over_CountrySpecific_DOMAIN.pdf"
Output File: "SM_Figure_Deflated_Delta_over_CorePeriphery.pdf"
Output File: "SM_Figure_Deflated_Delta_over_CorePeriphery_DOMAIN.pdf"
Journal Censoring
Output File: "SM_Figure_JournalCensored_over_CountrySpecific.pdf"
Output File: "SM_Figure_JournalCensored_over_CountrySpecific_Domain.pdf"
Output File: "SM_Figure_JournalCensored_over_CorePeriphery.pdf"
Output File: "SM_Figure_JournalCensored_over_CorePeriphery_Domain.pdf"
Language Censoring
Output File: "SM_Figure_Language_over_CountrySpecific.pdf"
Output File: "SM_Figure_Language_over_CountrySpecific_Domain.pdf"
Output File: "SM_Figure_Language_over_CorePeriphery.pdf"
Output File: "SM_Figure_Language_over_CorePeriphery_Domain.pdf"
First Appearance of Countries Censoring
Output File: "SM_Figure_FirstApperance_Impact_on_Delta_CorePeriphery.pdf"
Output File: "SM_Figure_FirstApperance_Impact_on_Delta_CorePeriphery_Domain.pdf"

=== Part IV:  Regression (Supplemental Information) ===
- Step_X7_R_RR_Regression_Citational_Well_Unexplained_Variance.R
Input File: "INPUT_MAG_FieldIDs_Domain.csv"
Input File(s): "OUTPUT_Python_Field_Delta_Degree_Centrality_KLD_Citation_Corpus_RAKE_and_GoogleAPI_EnglishOnly_*"
Input File(s): "OUTPUT_Python_Field_Degree_Centrality_KLD_Citation_Corpus_RAKE_and_GoogleAPI_EnglishOnly_*"
Hierarchical Linear Models (HLMs) 
Output File: "SM_Table_R_DV_Delta_HLM.html"
Output File: "SM_Table_R_DV_KLD_HLM.html"
